Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support async networking #63

Closed
wants to merge 53 commits into from
Closed

Conversation

rnburn
Copy link
Contributor

@rnburn rnburn commented Apr 16, 2020

Adds an interface for managing async file and timer events within the io polling system calls.

Will support asynchronous networking for tracers and meters.

A concrete implementation is provided with libevent.

The dispatcher interface was derived from Envoy.

@pyohannes
Copy link
Contributor

pyohannes commented Apr 17, 2020

Will support asynchronous networking for tracers and meters.

What major use cases for asynchronous networking do you have in mind? Is this intended to be used by exporters?

@rnburn
Copy link
Contributor Author

rnburn commented Apr 17, 2020

Asynchronous networking is more efficient. If the tracer uses synchronous networking, then it could be blocked writing data to a socket even if could move forward with work on other sockets or events.

Exporters can use the interface, but they're also welcome to do their own synchronous networking (though that will block the background thread).

Additionally, asynchronous networking is important for allowing an application to exit quickly. If you do a synchronous write to an endpoint that's not responding and the app exits, you can hang up the app until the write's timeout is exceeded.

@pyohannes
Copy link
Contributor

Asynchronous networking is more efficient. If the tracer uses synchronous networking, then it could be blocked writing data to a socket even if could move forward with work on other sockets or events.

Yes, I completely agree. I mostly wonder where in the SDK it will be used, apart from exporters.

@rnburn
Copy link
Contributor Author

rnburn commented Apr 17, 2020

It could also be used in the sdk to set up the the timers that regularly trigger the span buffer to flush (similarly for metrics)

And asynchronous dns resolution can be supported over an interface like this with c-ares.

}),
make_commands = select({
"@io_opentelemetry_cpp//bazel:windows": ["MSBuild.exe INSTALL.vcxproj"],
"//conditions:default" : None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this file didn't go through buildifier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update format script to run on these files

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in #69

ci/install_libevent.sh Show resolved Hide resolved
sdk/include/opentelemetry/sdk/event/dispatcher.h Outdated Show resolved Hide resolved
/**
* Run the event loop until.
*/
virtual void Run() noexcept = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this documented? As a user of the SDK, what do I need to do with the Dispatcher?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's intended to support a design where you have a single background thread and asynchronous networking. The background thread would be making successive calls to the systems multiplexing io function (e.g. epoll, kqueue, etc) and invoking callbacks as events are ready.

If you're a user of the SDK and you wanted to use async networking for your vendor exporter, you'd call CreateFileEvent and pass the file descriptors for the sockets you used to connect to your endpoint.

Or if you don't want to do asynchronous io, you can ignore the dispatcher, use synchronous sockets, and block the background thread when you want to do io (less efficient and can hang the process from exiting but supported).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we capture this in a doc please? :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added commenting

sdk/include/opentelemetry/sdk/event/file_event.h Outdated Show resolved Hide resolved
sdk/src/event/libevent/BUILD Outdated Show resolved Hide resolved
t1 = std::chrono::steady_clock::now();
dispatcher.Run();
auto duration = t2 - t1;
EXPECT_TRUE(duration > std::chrono::milliseconds{50});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On different CI systems we might need to adjust the tolerance of timing.
Consider making this a configurable thing which we could leverage in other test cases (in particular metrics and async exporter).

https://github.com/census-instrumentation/opencensus-python/blob/99821a0b826445cecce0881a01790c8cc6958ceb/tests/unit/metrics/test_transport.py#L32

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to pick a constant that should be large enough for most environments, though we can probably go larger without slowing down the tests too much.

Copy link
Member

@reyang reyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@g-easy
Copy link
Contributor

g-easy commented May 4, 2020

Just needs an autoformat to pass CI.

void Dispatcher::Exit() noexcept {}
void Dispatcher::Exit() noexcept
{
running_ = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not meant to be threadsafe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, meant for interacting with only a single background thread.

while (running_ && !events_.empty())
{
auto next_event = events_.begin();
std::this_thread::sleep_until(next_event->first);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can this sleep be interrupted?

Copy link
Contributor Author

@rnburn rnburn May 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend a short polling interval over an interrupt mechanism.

I'd rather people use the libevent implementation, and this is meant to be a drop-in replacement that provides the same interface and allows you to use the same codebase but without an event library (in case you only care about synchronous networking and don't want to add an external dependency).

When the dispatcher is built on top of libevent, the background thread makes successive calls to the systems polling function (e.g. epoll or an equivalent). Epoll blocks waiting for either a timeout or and event on a list of file descriptors.

I believe it is possible to artificially interrupt epoll if you create a dummy file descriptor for it to include in its list and then close the file descriptor, forcing an event.

But I prefer, and I think it's simpler, to use a polling function with a short timer interval that monitors for termination, closure, and flush events. Basically what's done here
https://github.com/lightstep/lightstep-tracer-cpp/blob/master/src/recorder/stream_recorder/stream_recorder_impl.cpp#L57-L82

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sleep_until may hang "forever" in gcc-4.8 and vs2017 15.4 when system clock changes, i.e. after NTP time sync.

Copy link
Contributor Author

@rnburn rnburn May 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It uses the steady clock which shouldn't have that problem

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is possible interrupt epoll

According to man epoll_wait can be interrupted by a signal handler;

you create a dummy file descriptor for it to include in its list and then close the file descriptor, forcing an event.

ref: http://cr.yp.to/docs/selfpipe.html

@pyohannes pyohannes requested a review from a team July 6, 2020 23:04
@codecov
Copy link

codecov bot commented Jul 7, 2020

Codecov Report

Merging #63 (cc34479) into main (65f6639) will decrease coverage by 2.90%.
The diff coverage is 54.28%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main      #63      +/-   ##
==========================================
- Coverage   93.61%   90.70%   -2.91%     
==========================================
  Files          71       85      +14     
  Lines        1754     1894     +140     
==========================================
+ Hits         1642     1718      +76     
- Misses        112      176      +64     
Impacted Files Coverage Δ
sdk/include/opentelemetry/sdk/event/file_event.h 0.00% <0.00%> (ø)
sdk/src/event/libevent/file_event.cc 0.00% <0.00%> (ø)
sdk/src/event/libevent/file_event.h 0.00% <0.00%> (ø)
sdk/src/event/libevent/io_dispatcher.cc 50.00% <50.00%> (ø)
sdk/src/event/libevent/event_base.cc 54.54% <54.54%> (ø)
sdk/src/event/libevent/event.cc 65.62% <65.62%> (ø)
sdk/src/event/libevent/timer.cc 76.92% <76.92%> (ø)
sdk/test/event/libevent/io_dispatcher_test.cc 85.71% <85.71%> (ø)
sdk/include/opentelemetry/sdk/event/dispatcher.h 100.00% <100.00%> (ø)
...dk/include/opentelemetry/sdk/event/io_dispatcher.h 100.00% <100.00%> (ø)
... and 4 more

@pyohannes
Copy link
Contributor

Assigning to @lalitb, as discussed in the SIG meeting on October 21st.

The main challenge here would be rebasing and making all GH actions pass.

@alolita
Copy link
Member

alolita commented Dec 7, 2020

Triaged Dec 7 2020: This is a good change to integrate but non-trivial. The maintainers need to figure out what items are blocked on this. This is not an immediate priority. Will keep open until re-evaluated.

@ThomsonTan
Copy link
Contributor

@lalitb do we need rework on this PR to get it merged?

@github-actions github-actions bot added the Stale label Dec 7, 2021
@github-actions github-actions bot closed this Dec 15, 2021
@lalitb lalitb reopened this Dec 15, 2021
@linux-foundation-easycla
Copy link

CLA Not Signed

  • ❌ The commit (6234ee6 ,faf297319f1d0947ae74b719564ed25f1828247e ,1de581b781dc07ffe97650aa5fd9b5ed9d84692c ,2cbf66ca2938df2326465e07999d30b8cc227275 ,8248ddd574bdec2eb7db0fc4f870457cf01230e1 ,708f7b0c483ca2713320f66595626e5403ca5337 ,cc34479373512e7090a706a76fa843af75af8cbe) is missing the User's ID, preventing the EasyCLA check. Consult GitHub Help to resolve.For further assistance with EasyCLA, please submit a support request ticket.

@github-actions github-actions bot closed this Dec 22, 2021
@lalitb
Copy link
Member

lalitb commented Dec 22, 2021

Keeping it open, for someone to take a dig at it.

@lalitb lalitb reopened this Dec 22, 2021
@linux-foundation-easycla
Copy link

CLA Not Signed

  • ❌ The commit (6234ee6 ,faf297319f1d0947ae74b719564ed25f1828247e ,1de581b781dc07ffe97650aa5fd9b5ed9d84692c ,2cbf66ca2938df2326465e07999d30b8cc227275 ,8248ddd574bdec2eb7db0fc4f870457cf01230e1 ,708f7b0c483ca2713320f66595626e5403ca5337 ,cc34479373512e7090a706a76fa843af75af8cbe) is missing the User's ID, preventing the EasyCLA check. Consult GitHub Help to resolve.For further assistance with EasyCLA, please submit a support request ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants